[NPU]: Add NPU support for the embedding #1028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

Tcc0403 merged 1 commit into linkedin:main from TianHao324:embedding

Jan 21, 2026

+220 −0

Contributor

TianHao324 commented Jan 19, 2026 •

edited

Loading

Summary

Add NPU support for the embedding.

Implements a flattened, grid-stride Triton kernel for embedding forward/backward to improve scalability and reduce launch overhead on Ascend NPUs.
Uses UB-aware tiling (compute_default_tiling_strategy) and NPU vector core count to dynamically select block size and grid size for better performance stability.

Testing Done

I tested swiglu by following method and all cases passed:

python benchmark/scripts/benchmark_embedding.py
pytest -v test/transformers/test_embedding.py

Hardware Type: Ascend NPU 910B4
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Contributor Author

TianHao324 commented Jan 19, 2026

test_embedding result：

Contributor Author

TianHao324 commented Jan 19, 2026

Hi @Tcc0403, could you please help me review my code?

TianHao324 mentioned this pull request

[NPU Roadmap] NPU support for Liger-Kernel #969

Open

3 tasks

Tcc0403 requested changes

View reviewed changes

Collaborator

Tcc0403 left a comment

It seems the current implementation is quite inefficient. I've left some comments about some possible issues it might have.

src/liger_kernel/ops/backends/_ascend/ops/embedding.py Outdated

		)


		def get_optimal_block_size(total_elements, is_backward: bool):

Collaborator

Tcc0403 Jan 19, 2026

what does is_backward do?

Contributor Author

TianHao324 Jan 20, 2026

Sorry, at first I intended to distinguish the forward and backward directions. Later, I realized their logic was quite similar and I forgot to delete it.

src/liger_kernel/ops/backends/_ascend/ops/embedding.py

Comment on lines 14 to 20

+              @triton.jit
+              def embedding_forward_kernel(
+                  embeddings_ptr,
+                  indices_ptr,
+                  output_ptr,
+                  total_elements,
+                  n_elements,
+                  embedding_dim: tl.constexpr,
+                  BLOCK_SIZE: tl.constexpr,
+                  NUM_STAGES: tl.constexpr,
+              ):

Collaborator

Tcc0403 Jan 19, 2026

I think the original implementation with 2 block sizes for tile shape is more readable and more efficient.

persistant grid loop is fine, but the way this kernel loading embedding seems to be uncoalesced at some point.

Collaborator

Tcc0403 Jan 19, 2026

For instance, there will be some dim_idx not consecutive if BLOCK_SIZE is not multiple of embedding_dim. It will make the second tl.load trying to access different rows within a warp, as well as the last store.

Make these offsets created with 2d block size is more readable and efficient since we can avoid the uncoalesced access mentioned above.

Contributor Author

TianHao324 Jan 20, 2026

I have changed it to 2D block. After testing, it has indeed shown much better performance. The issues mentioned below have also been fixed. Could you please review it for me again?

src/liger_kernel/ops/backends/_ascend/ops/embedding.py

Comment on lines 110 to 126

+                  tile_shapes = compute_default_tiling_strategy(
+                      safety_margin=0.9, dtype_size=4, memory_multiplier=multiplier, shapes=((total_elements,),), tiling_dims=(0,)
+                  )

Collaborator

Tcc0403 Jan 19, 2026

dtype_size should be embedding.dtype?

Contributor Author

TianHao324 Jan 20, 2026

Modified

src/liger_kernel/ops/backends/_ascend/ops/embedding.py Outdated

+                      block_size = tile_shapes[0][0]
+                      return block_size
+                  else:
+                      return triton.next_power_of_2(total_elements)

Collaborator

Tcc0403 Jan 19, 2026

I think fallback value should be workable, triton.next_power_of_2(total_elements) is too large.

Contributor Author

TianHao324 Jan 20, 2026

Modified

src/liger_kernel/ops/backends/_ascend/ops/embedding.py Outdated

+                          embeddings_ptr + embedding_offsets,
+                          mask=final_mask,
+                          other=0.0,
+                      ).to(tl.float32)

Collaborator

Tcc0403 Jan 19, 2026

any consideration why we need to upcast it?

Contributor Author

TianHao324 Jan 20, 2026

Modified


          [NPU]: Add NPU support for the embedding

fc2b201

TianHao324 force-pushed the embedding branch from 9a6b4cf to fc2b201 Compare

January 20, 2026 10:50

Collaborator

Tcc0403 commented Jan 20, 2026

Could you attach the benchmark results for reference?

Contributor Author

TianHao324 commented Jan 20, 2026 •

edited

Loading

Could you attach the benchmark results for reference?

Currently, compared to the previous version, the performance has improved by 4 to 5 times. However, it still has a significant difference compared to HuggingFace. But I attempted to use the original GPU code (only addressing the UB issue), and the performance was nearly the same (the results are shown below).

[
  {
    "kernel_name": "embedding",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "V",
    "x_label": "embedding dimension",
    "x_values": [
      1024,
      2048,
      4096,
      8192,
      16384,
      32768,
      65536,
      131072
    ],
    "y_values_50": [
      42.66733932495117,
      43.84379959106445,
      43.834800720214844,
      43.53144836425781,
      43.65476989746094,
      42.79145050048828,
      44.18817138671875,
      44.12928009033203
    ],
    "y_values_20": [
      42.66537094116211,
      43.84306716918945,
      43.83445358276367,
      43.531349182128906,
      43.65372085571289,
      42.7907829284668,
      44.18741989135742,
      44.12871551513672
    ],
    "y_values_80": [
      42.669307708740234,
      43.84453201293945,
      43.835147857666016,
      43.531551361083984,
      43.655818939208984,
      42.792118072509766,
      44.18891906738281,
      44.129844665527344
    ],
    "timestamp": "2026-01-20 10:33:22",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 32, \"T\": 512, \"D\": 768, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "embedding",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "V",
    "x_label": "embedding dimension",
    "x_values": [
      1024,
      2048,
      4096,
      8192,
      16384,
      32768,
      65536,
      131072
    ],
    "y_values_50": [
      0.08077999949455261,
      0.091559998691082,
      0.1134599968791008,
      0.14830000698566437,
      0.1863200068473816,
      0.21172000467777252,
      0.22543999552726746,
      0.2385600060224533
    ],
    "y_values_20": [
      0.08038800209760666,
      0.09114000201225281,
      0.11287999898195267,
      0.14771999418735504,
      0.18585199117660522,
      0.21121999621391296,
      0.22499999403953552,
      0.23792000114917755
    ],
    "y_values_80": [
      0.08191999793052673,
      0.09239999949932098,
      0.11416800320148468,
      0.14903999865055084,
      0.18700000643730164,
      0.21240000426769257,
      0.22592000663280487,
      0.23929999768733978
    ],
    "timestamp": "2026-01-20 10:33:35",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 32, \"T\": 512, \"D\": 768, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.0.0"
  },

Implement using GPU：

Tcc0403 reviewed

View reviewed changes

Collaborator

Tcc0403 left a comment

I'm fine with merging this PR since it's an experimental operator and isn’t used in any patching path. That said, we should probably open a performance issue for this kernel and track it for future improvements.

Contributor Author

TianHao324 commented Jan 21, 2026

I'm fine with merging this PR since it's an experimental operator and isn’t used in any patching path. That said, we should probably open a performance issue for this kernel and track it for future improvements.

You're right. In fact, we do have plans to improve the performance. Currently, we need to first support these operators on the NPU and explore ways to optimize the performance as much as possible.

Collaborator

Tcc0403 commented Jan 21, 2026

Could you open an issue with benchmarking results so we can track this performance problem and allow future contributors to work on it?

Contributor Author

TianHao324 commented Jan 21, 2026

Could you open an issue with benchmarking results so we can track this performance problem and allow future contributors to work on it?

Sure! #1036

Collaborator

Tcc0403 commented Jan 21, 2026

Thank you!

Tcc0403 approved these changes

View reviewed changes

Tcc0403 merged commit 57e98d3 into linkedin:main

3 of 7 checks passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet